An Efficient Algorithm for Outlier Detection in High Dimensional Real Databases
نویسندگان
چکیده
Detecting outlier patterns in data has been an important research topic in statistics, data mining and machine learning communities for many years. Research in identifying effective solutions to this problem have several interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Among the different algorithms, statistical (parametric) approaches and distance-based outlier detection are the most popular in use. The former is well grounded but often has difficulty scaling to large and high dimensional data. The latter is relatively efficient and empirically found to be effective on a number of domains but scalability is still an issue in spite of a fair bit of research on the topic. To address this limitation, in this work, we propose Atalaia, an efficient and scalable distance-based algorithm for detecting outliers in large high dimensional databases. Central to our algorithm is a fast strategy to estimate the unusualness of a record within the database and use a rank-ordered approach to evaluate records. Our algorithm partitions the database and ranks the objects that are candidates to be an outlier, reducing significantly the number of comparisons among objects. We evaluate different ranking heuristics in a comprehensive set of real and synthetic databases. Further, Atalaia also handles heterogeneous databases, i.e, those containing both categorical and continuous attributes. The results show that our algorithm outperforms by up to 73% the state-of-the-art distance-based outlier detection algorithm.
منابع مشابه
Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data
Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...
متن کاملDetecting High-Dimensional Outliers: the New Task, Algorithms and Performance
Outlier detection is a fundamental step in knowledge discovery in databases. With the increasing number of high-dimensional databases, existing outlier detection algorithms that work only in the context of full space are unable to effectively screen out informative outliers. This is because majority of these outliers exists only in subspaces. In this paper, we identify a new outlier detection t...
متن کاملOutlier detection for high dimensional data pdf
Is particularly useful for high dimensional data where outliers cannot be found.High dimensional data in Euclidean space pose special challenges to data. In about just the last few years, the task of unsupervised outlier detection has found.Outlier detection is an outstanding data mining task referred to open pdf with mac word class="text" href="https://tokiqivy.files.wordpress.com/2015/06/opel...
متن کاملApplication of Recursive Least Squares to Efficient Blunder Detection in Linear Models
In many geodetic applications a large number of observations are being measured to estimate the unknown parameters. The unbiasedness property of the estimated parameters is only ensured if there is no bias (e.g. systematic effect) or falsifying observations, which are also known as outliers. One of the most important steps towards obtaining a coherent analysis for the parameter estimation is th...
متن کاملOutlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator
The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008